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This invention relates to a method of synchronising the delivery to a user of 
content in a multi-modal interfece and system which implements the method. In 
5 particular, but not exclusively, the invention concerns a method and system for 
synchronising delivery of visual and audible information in a multi-modal interface. 

A multi-modal interface is a type of man-machine interface in which: (I) a user is 
either presented with information in two or more modes, for example visual information 
presented on a display and audible information, which may be spoken, presented 

10 audibly; and/or a user may provide input in two or more modes, for example a spoken 
input and a physical (motor) input (such as operation of a keyboard, or the operation of a 
cursor control device such as a mouse or track ball). Commonly, multi-modal interfaces 
are multi-modal both for the presentation of information to a user and for the receipt of 
information from a user. The present invention is applicable to multi-modal interfaces 

1 5 which are multi-modal for the presentation of Information to a user, whether or not the 
interface is also multi-modal for the rece ' . ^nation from the user. 

Some multi-modal interfaces r::,.. designed for use on self-contained 

machines, such as desk-top computers, which contain a processor which operates the 
multi-modal interface and which ensures that Information to be presented visually and 

20 information to be presented audibly are delivered to the user in the correct sequence and 
with appropriate timings. So, for example, a voice prompt to ''select your preferred hotel 
from the list on the screen" is not provided until the processor knows that the 
appropriate list of hotels has been displayed on the machine's display. Such control is a 
trivial matter when the controlling process is on the same machine as the presentation 

25 devices or when the process which runs the multi-modal intert'ace effectively has direct 
control of the systems which retrieve the stored information and present it to the user. 
This applies whether or not the information which needs to be presented to the user is all 
stored on the self-contained machine, since the controlling process could easily pre- 
emptively download content files if they were not local. 

30 In other multi-modal interfaces the controlling process and the presentation 

devices are remote from each other, the latter not necessarily under the control of the 
former. Often the information needed for each of the different output modes is stored 
separately and different processes or communications paths are used for the retrieval of 
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the stored Information. Additionally, the multi-modal interface may be provided by more 
than one user terminal, for example a visual element may be provided by a computer or 
PDA and the audible element may be provided over a telephone (fixed-line or mobile). In 
all these situations it can be very difficult to ensure that the multi-modal interface 
5 operates correctly. In particular, if information which is presented visually and that which 
is presented audibly are presented in a unsynchronised manner, the user will become 
confused and the interface will operate less well than a uni-modal interface. 
The present Invention seeks to address such problems. 

W099/44363 describes methods for synchronising sound and images in a real- 

10 time multimedia communication, such as an audio-video telephone call, through a 
network gateway, when the source and/or the destination of the audio signals, and 
optionally also the video signals, is from and/or to separate audio and video 
communication devices. It is explained that internal processing delays in the gateway 
can give rise to a lack of synchronisation between sound and video signals passing 

1 5 through the gateway. The gateway delay may be due, for example, to the need to 
translate an ■ : ^ j i^l from one standard used for transmission to the gateway input ^ 
to a difFereiu .\v^ni^:>j for onward transmission from the gateway output. It is explained . 
that it is usual to transcode the audio signals passing through a gateway, but less usual 
to transcode video signals. This can give rise to the audio signals experiencing delays 

20 which are not experienced by video signals which happen also to pass through the 
gateway. It is further explained that the audio and video signals may become further de- 
synchronised by the transit delay (ie propagation delay) between the gateway and the 
audio and video devices at the receiver. The term ''synchronisation delay" is used in this 
reference to describe the total net difference between the audio and video signal delays, 

25 including delays through the gateway. The expression ''sensory output delay" is used to 
define the time difference between the audio and video which the user perceives at the 
receiving terminal- It is suggested that the variable sensory output delay may be 
reduced if the magnitude of the actual delay is measured and then this measured value is 
used to delay the video or audio signal appropriately. In order to achieve this it is 

30 suggested that a user of the terminal gives feedback, for example using DTI^F signalling 
to adjust the operation of the gateway until synchronisation is perceived by the user to 
exist between the speech and video signals. Once this variable sensory output delay has 
been determined, it is said to be possible to accommodate for a delay, referred to as 
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intrinsic device transmission delay, (commonly referred to as skew) which arises from 
encoding delays within a device prior to transmission of the encoded signal to the 
gateway. TWs accommodation may be accomplished by looping back the signals from 
the separate devices to the gateway, then detecting any mismatch in the synchronisation 

5 between the looped back signals (audio and video) from the separate devices at the 
gateway caused by intrinsic device transmission delay and then adjusting a delay (the 
variable device transmission delay) in the gateway so that the looped back signals at the 
gateway are effectively synchronised. Optionally, a synchronisation marker is provided In 
the audio and video signals to facilitate the automatic detection of any mismatch in the 

0 synchronisation between the looped back signals. Overall, W099/44363 relies very 
largely on calibration of various terminal types and transmission link types, together with 
calibration of the gateway itself as well as the use of marker pulses in the data streams. 
Moreover, more practical versions of the synchronisation method all rely upon user 
feedback to control the perceived synchronisation. While this may be a plausible 

5 approach whether there Is effectively a need for lip synchronisation, for example when 
the system is used in a video telephony link. It is harder to see how this might usefully be 
used in a multi-modal interface situation. 

In a first aspect the invention provides a method of synchronising the delivery to 
a user of first information which is to be presented to the user via first output means of 
20 a multi-modal interface and of second information which is to be presented to the user 
via second output means of the multi-modal interi'ace, the method comprising the steps 
of: 

i) estimating the total time needed to deliver the finst information to the first 

output means or to a store local to the first output means; 
25 ii) estimating the total time needed to deliver the second information to the second 

output means or to a store local to the second output means; and 

Hi) using the estimates obtained in step i) or step ii) to determine whether the 

presentation to the user of the first or second information to the user needs to be 

delayed to achieve a desired synchronism of presentation; and 
30 iv) applying any delay determined in step iii) to achieved the desired synchronism of 

presentation. 

In a second aspect the invention provides a method of synchronising the delivery to a 
user of first information which is to be presented to the user via a visual display of a 
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multi-modal Interface and of second Information which is to be presented to the user 
over a visual or an audio interface of the multi-modal interface, the method comprising 
the steps of: 

i) estimating the total time needed to deliver the first information to the visual 
5 display or to a store local to the visual display; 

ii) estimating the total time needed to deliver the second information to the visual 
or audio interface or to a store local to the visual or audio interface; and 

iii) using the estimates obtained in step I) or step il) to determine whether the 
presentation to the user of the first or second information to the user needs to be 

1 0 delayed to achieve a desired synchronism of presentation; and 

iv) applying any delay determined in step iii) to achieved the desired synchronism of 
presentation. 

In a third aspect the invention provides a method of synchronising the delivery to a user 
1 5 of first Information which Is to be presented to the user via a visual display and of second 
information which is to be presented to the user over an audio interr?' v ^ .nethbd 
comprising the steps of: 

(1) estimating the total time needed to deliver the first information to the visual 
display or to a store local to the visual display; 
20 (il) estimating the total time needed to delivery the second information to the audio 
interface or to a store local to the audio interface; and 

(iii) if the total time estimated in step (i) Is more than that estimated in step (ii) 
delaying the presentation of the second information to the user sufficiently to enable the 
first information to be presented to the user before the second information is presented 
25 to the user. 

In a fourth aspect the invention provides a system of apparatus for the delivery to a 
user of first information which is to be presented to the user via first output means of a 
multi-modal interface and of second information which is to be presented to the user via 
second output means of the multi-modal interface, the system Including processing 
30 means configured to: 

estimate the total time needed to deliver the first information to the first output means 
or to a store local to the first output means; 
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estimate the total time needed to deliver the second information to second output means 
or to a store local to the second output means; and 

to use the estimates obtained to determine whether the presentation to the user of the 
first or second infomiation to the user needs to be delayed to achieve a desired 
5 synchronism of presentation; and to cause 

any delay determined to be necessary to be applied to achieve the desired synchronism 
of presentation. 

In a fifth aspect the Invention provides a system of apparatus for the delivery to a user 
of first infonnatlon which is to be presented to the user via a visual display of a multi- 
10 modal interface and of second infomiation which is to be presented to the user over a 
visual or an audio Interface of the multi-modal interface, the system Including processing 
means configured to: 

estimate the total time needed to deliver the first information to the visual display or to 
a store local to the visual display; 
1 5 estimate the total time needed to deliver the second Information to the visual or audio 
Interface or to a store local to the visual or aurii a ^ v Tice; and 

to use the estimates obtained to determine w; me presentation to the user of the 
first or second information to the user needs to be delayed to achieve a desired 
synchronism of presentation; and to cause 
20 any delay determined to be necessary to be applied to achieve the desired synchronism 
of presentation. 

The invention will now be described, by way of example only, with reference to 
the accompanying drawings in which: 

Figure 1 is a schematic diagram showing equipment to provide a multi-modal 
25 interface; 

Rgure 2 shows schematically an alternative system of hardware to provide a 
multi-modal interface; and 

Rgure 3 shows schematically a further system of hardware to provide a multi- 
modal interface. 

30 

Specific Description 

Before describing and explaining the invention it Is necessary for the reader to have some 
understanding of the context of the invention. To this end, Rgure 1 shows an example of 
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a system set up to provide a multi-modal Interface. This will now be described as an 
introduction to the invention. It should be noted however that the invention is not 
restricted In Its application to systems of the type shown in Figure 1. 
Rgure 1 shows a basic system on which the invention can be implemented. The system 
5 includes a telephone 20 which is connected, in this case, over the public switched 
telephone network (PSTN) to a VoiceXML based interactive voice response unit (IVR) 22, 
The telephone 20 Is co-located with a conventional computer 24 which Includes a VDU 26 
and a keyboard 28. The computer also includes a memory holding program code for an 
HTML web browser, such as Netscape Navigator or Microsoft's Internet Explorer, 29, and 

10 a modem or network card (neither shown) through which the computer can access the 
Internet (shown schematically as cloud 30) over communications link 32. The Internet 30 
includes a server 34 which has a link 36 to other servers and computers In the Internet. 
Both the IVR unit 22 and the Intemet server 34 are connected to a further server 38 
which we will term a synchronisation server. Note that IVR unit 22, Intemet server 34 

15 and synchronisation server may reside on the same hardware server or may be 
distributed across f ? r l ^nachines. 

In the eAi^^*e ^hown a user has given a URL to the HTML browser, the process 
of which is running on the computer 24, to direct the browser 29 to the web-site of the 
user's bank. The user is interested in finding out what mortgage products are available, 

20 how they compare one with another and which one is most likely to meet his needs. All 
this information is theoretically available to the user using just the HTML browser 29, 
although with such a uni-modal interface data entry can be quite time consuming. In 
addition, navigating around the bank's web-site and then navigating between the various 
layers of the mortgage section of the web-site can be particularly slow. It is also slow or 

25 difficult to jump between different options within the mortgage section. This is 
particularly true because mortgage products are introduced, modified and dropped fairiy 
rapidly in response to changing market conditions and in particular in response to the 
offerings of competitors. So the web site may be subject to fairly frequent design 
changes, making familiarisation more difficult. In order to improve the ease of use of the 

30 system there is provided a multi-modal interface through the provision of a dial-up IVR 
facility 22 which is linked to the web-site hosted by the server 34. The link between the 
IVR facility 22 and the server 34 is through the synchronisation manager 38. 
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The web-site can function conventionally for use with a conventional graphical 
interface (such as that provided by Navigator or Internet Explorer when run on a 
conventional personal computer and viewed through a conventional screen of reasonable 
size and good resolution). However, users are offered the additional IVR facility 22 so 
5 that they can have a multi-modal interface. The provision of such interfaces has been 
shown to improve the effectiveness and efficiency of an Internet site and so is a desirable 
adjunct to such a site. 

The user begins a conventional Internet session by entering the URL of the web-site into 
the HTML browser 29, The welcome page of the web-site may initially offer the option of 
1 0 a multi-modal session, or this may only be offered after some security issues have been 
dealt with and when the user has moved from the welcome page to a secure page after 
some form of log-in. 

In this example the web-site welcome page asks the user to activate a ''button" on 
screen (by moving the cursor of the graphical user interface (GUI) on to the button and 

1 5 then ''clicking" the relevant cursor control button on the pointing device or keyboard) if 
ey wish to use the multi-modal interface. Once this is done, a new page appears 
snowing the relevant telephone number to dial and giving a PIN (e.g. 007362436) and/or. 
control word (e.g. swordfish) which the user must speak when so prompted by the IVR 
system 22. The combination of the PIN or control word and the access telephone number 

20 will be unique to the particular Internet session in which the user is involved. The PIN or 
password may be set to expire within five or ten minutes of being issued. If the user 
delays setting up the multi-modal session to such an extent that the password has 
expired, then the user needs to re-click on the button to generate another password 
and/or PIN. 

25 Alternatively this dialling information may included in the first content page rather than 
as a separate page. 

Alternatively if the user was required to login to the website then the 'dick' may result in 
the IVR system making an outbound call to the user at a pre-registered telephone 
number. 

30 In addition the welcome page may include client side components of the synchronisation 
manager which are responsible for detecting user interface changes (e.g. changes in the 
form field focus or value) in the Visual browser and transmith'ng these to the 
synchronisation manager, as well as receiving messages from the synchronisation 
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manager which contain instructions on how to Influence the user interface (e.g., moving 
to a particular fonri filed, or changing a form field's value) 

In addition when providing this page the synchronisation manager provides the web 
browser with a session identifler which will be used In all subsequent messages between 
5 the synchronisation manager and the web browser or client components downloaded or 
pre-lnstalled on the web browser. 

In the case where the user calls the IVR system, using the telephone 20, the user is 
required to enter, at the voice prompt, the relevant associated Items of Information which 
will generally be the user's name plus the PIN or password (if only one of these is issued) 

10 or to enter the PIN and password (if both are Issued by the system) In which case entry 
of the user's name will be in general not be needed (but may still be used). Although the 
PIN, If used, could be entered using DTMF signalling, for example. It is preferred that 
entry of all the relevant items of information be achieved with the user's voice. The IVR 
system will typically offer confirmation of the entries made (e.g. by asking ''Did you say 

1 5 007362436?" or ''Did you say swordfish?"), although this may not be necessary if the 

confidence of recognition of all the items is high. Once the IVR system hav: :^c.^/J ^i^ ' - w 

necessary data, plus confirmation, if required, it sends a call over the data ifrt^< t?3 £?te ^ 
synchronisation manager 38 and provides the synchronisation manager 38 with the PIN, 
password and/or user name as appropriate. The synchronisation manager 38 then 

20 determines whether or not it has a record of a web session for which the data supplied 
by the IVR system are appropriate. If the synchronisation manager 38 determines that 
the identification data are appropriate it sends a message to both the IVR system 22 
informing It of the current voice dialogue to be run by the IVR and providing the IVR with 
a session identifier which is used by the IVR application when making subsequent 

25 information requests and data updates to the synchronisation manager. The initial 
dialogue presented by the IVR system 22 may also provide voiced confirmation to the 
user that the attempt to open the multi-modal interface has been successful. Preferably 
the web server 38 also sends confirmation to the computer 24, typically via a new HTML 
page, which is displayed on screen 26, so that the user knows that the attempts to open 

30 the multi-modal interface has been successful. 

At this point, either or both of the IVR system 22 and the web server 38 can be 
used to give the user options for further courses of action. In general it Is more effective 
to give the user a visual display of the (main) options available, rather than the IVR 
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system 22 providing a voiced output listing tlie options. This is because visual display 
makes possible a parallel or simultaneous display of all the relevant options and this is 
easier for a user (particularly one new to the system) to deal with than the serial listing 
of many options which a speech interface provides. However, an habituated user can be 
5 expected to know the option which It is desired to select. In this case, with a suitably 
configured IVR system, preferably with ''barge In" (i.e., the ability for the system to 
understand and respond to user inputs spoken over the prompts which are voiced by the 
IVR system Itself), and appropriately structured dialogues, the user can cut through many 
levels of dialogue or many layers (pages) of a visual display. So for example, the user 

10 may be given an open question as an Initial prompt, such as ''how can we help?" or 
"what products are you interested In?". In this example an habituated user might 
respond to such a prompt with "fixed-rate, flexible mortgages". The IVR system 
recognises the three Items of information in this input and this forces the dialogue of the 
IVR system to change to the dialogue page which concerns fixed-rate flexible mortgages. 

1 5 The IVR system requests this new dialogue page via the synchronisation server 38 using 
data link 40. Also, if the fact that the dialogue i.^ ^ .^jarticular new' page does not 
already imply "fixed-rate, flexible mortgages" any £u#^«i^^f information contained in that 
statement is also sent by the IVR system to the synchronisation server 38 as part of the 
request. 

20 The synchronisation server 38 uses the session identifier to locate the application group 
that the requesting IVR application belongs to and using the mapping means converts 
the requested voice dialogue page to the appropriate HTML page to be displayed by the 
Web browser. A message is then sent to the Web Browser 29 instructing it to load the 
HTML page corresponding to Fixed rate mort:gages from the web server 34 via the 

25 synchronisation manager 38 using data link 20. In tiiis way both the voice browser and 
the web browser are kept in synchronisation "displaying" the correct page. 
The fixed rate mortgage visual and voice pages may include a form containing one or 
more input fields. For example drop down boxes, check boxes, radio buttons or voice 
menus, voice grammars or DTMF grammars. The voice browser and the visual browser 

30 execute their respective user interi'ace as described by the HTML or VoiceXML page. In 
the case of the Visual browser this means the user may change the value of any of the 
input fields either by selecting from e.g. the drop down list or typing into a text box, for 
the voice browser the user is typically led sequentially through each input field in an 
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order determined by the application developer, although it is also possible that the voice 
page is a mixed initiative page allowing the user to fill in input fields in any order. 



The user selects an Input field either explicitly e.g. by clicking in a text box or implicitly 
5 as in the case of the voice dialog stepping to the next input field according to the 
sequence determined by the application developer. Then the client code components of 
the Synchronisation manager send messages to the synchronisation manager indicating 
that the current focus' input field has changed. This may or may not cause the focus to 
be altered In the other browsers, depending on the configuration of the synchronisation 

1 0 manager. If the focus needs to change in another browser then a message is sent from 
the synchronisation manager to the client component in the other browser to indicate 
that the focus should be changed. For example If the voice dialog asks the question "'How 
much do you want to bon-ow" then the voice dialogue will Indicate that the voice focus Is 
currently on the capital amount field. If so configured then the synchronisation manager 

1 5 will map this focus to the corresponding Input element In the visual browser and will send 
a message to the vIsuH ' 'rr to iset the focus to the capital amount field within the 
HTI^L page, this mtj/ Vfe^^ h^r^ visible change In the user interface, for example the: 
background colour of the input element changing to indicate that this element now has 
focus. If the user then responds ''80,000 pounds" to the voice dialogue then the input Is 

20 detected by the client component resident in the voice browser and transmitted to the 
synchronisation manager. The synchronisation manager determines whether there is a 
corresponding input element in the HTI^L page, performs any conversion on the value 
(e.g. 80,000 pounds may correspond to index 3 of a drop down list of options 50,000 
60,000 70,000 80,000) and sends a message to the client component in the HTML 

25 browser instructing it to change the html input field appropriately. In parallel the user 
may also have clicked on the check box in the HTTIL page indicating that a repayment 
mortgage is preferred, this change in value of the input field is transmitted via the 
synchronisation manager to the voice browser client components which modify the value 
of the voice dialog field corresponding to mortgage type such that the voice dialogue will 

30 now skip the question ''Do you want a repayment mortgage?" since this has already been 
answered by the user through the HTML Interface. Hence It can be seen that the 
combination of the client side components and the synchronisation manager enable user 
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inputs that affect the values of input elements of a form within an HTML or voiceXML 
page are Icept in synchronisation. 

I^ore typically, the fixed-line telephone 20 of the Figure 1 arrangement will be replaced 
5 with a mobile telephone, smart phone or PDA with a cellular radio interface (GSM, GPRS 
or UMTS). Similarly, the conventional computer 24 with a wired interface will be replaced 
with a lap top or palm top computer with a wired or wireless (infra red, Bluetooth, or 
cellular) Interface. Examples of such alternative configurations are shown in Rgures 2 
and 3. 

1 0 In Figure 2 a laptop computer 44 runs an HTML browser process 29, the GUI of which is 
visible on screen 26. The laptop is connected via a wireless data link 32 (such as a 
wireless LAN) to synchronisation server 38. The user of the laptop 44 also has a cellular 
telephone 50 which is connected via a GSM link 46 (of a cellular network) to a voice XML 
gateway 52. The gateway 52 is connected via a VXML channel 54 to the synchronisation 

1 5 server 38. The synchronisation server 38 is linked to a content and application server 58 
^ which content and application programs may be downloaded to either the mobile 
tesfjne 50 or the laptop 44. The multi-modal interface process which is controlled by 
the synchronisation server 38 makes use of a blackboard (data store ) 202 in the process 
of passing data updates between the various application programs (e.g. the HTML 

20 browser 29 and the Voice XML browser of the gateway 52) which make up the interface. 
The map file 203 is used by the synchronisation server 38 to ensure appropriate 
synchronisation between the browsers. 

In Figure 3 a smart phone 60 (or PDA with an appropriate mobile-telephony interface) 
replaces the separate display and telephone of the examples of Rgures 1 and 2. The 
25 smart phone 60 runs an HTML browser 29 and an audio client 64. These communicate 
via a wireless link with a synchronisation server 38. 

The invention concerns techniques for ensuring that the visual components of the multi- 
modal interface, which will be displayed by means of the VDU 26, are available to the 
user at an appropriate time with respect to the audio components, which are provided 
30 over the telephone 20. 

Various factors may need to be taken into account if the various information 
components are to be delivered with appropriate timing. Many of these are system 
specific and these will not be considered here in detail here. Examples are: 
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how long a browser takes to render (visually or whatever) the content; 

whether the content is dynamically generated (and how long generation 
takes - perhaps there's database access that slows it down); 

how error-prone the connection is (possibly necessitating unforeseen 
5 resend attempts); 

if a network-based voice browser is being used, it may well be supporting 
multiple users (loading the CPU more), in which case it will be slower to 
'render' the pages once they have arrived; and 

some types of content (Java applets, for example) may, once delivered, take 
10 significant time to 'start up' or display. 

There are three generic factors which may more often fell to be considered and they are 
network latency, network bandwidth and total document size. Methods of calculating 
these will now be described in turn below. 

1 5 The latency is a measure of the total time taken for data to travel from one part of the 
network to another. Usually, this will be quite small, but is potentially of the ^ r/v: 
seconds. Since diiehts may be located on different networi<s, this becomes an inrpsfi^S^^ 
consideration. A method is suggested for the estimation of network latency for each 
client, requiring no additional client software. This method also allows the difference 

20 between server and client clocks to be estimated. Once this is known, client requests to 
the sender can be more accurately time stamped, thereby giving a revised estimate of the 
latency. 

In the following description, times without a prime O are server times, and times with a 
prime are equivalent client times. Thus, the client's clock reads T2' when the server's 
25 clock reads T2. 

1. The client opens a connection to the server by requesting a specific page that is 
generated by a servlet; 

2. At time Ti the server returns a ver/ small document to the client using the open 
connection; 

30 3. Some time later, at time T2, the client receives the document and immediately sends 
it back with its own current time T2' attached; 
4. The packet arrives back at the server at time T3. 
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The server can then calculate, at leisure, the approximate network latency and the 
approximate difference in the clock times: 

latency^ ViC T3 - Ti) 

and: 

5 T2' + adjustment = T2 

adjustment = T2 - T2 
adjustment^ (Jx + latency - T2' 
adjustments Ti + y2( T3 - TO - T2' 
adjustments y2( T3 + Ti) - T2' 
10 When the client makes future requests to the server, it can time-stamp them T + 
adjustment, which will approximate to T, the server's time when the client's clock reads 

r. 

Implementation of Network Latency Estimation 
1 5 Three methods are proposed, all based upon HTML browsers, one of which can be used 
with a stand-alone Java-based browser, and one of whi::% * '>tes latency but does noj 
allow the difference between the client's and server's docks Ed^e estimated and the * 
latency re-estimated. 

Note that for all tiiree systems, the described implementation uses dedicated URLs to 
20 perform the latency estimation. It is reasonable to assume that the sen/er could return 
appropriate synchronisation code in response to any request made by the client, before 
returning the actual page requested, thereby removing the need for a specialised URL. 

(S) MTIMIL-lbsDSGd mettihod 
25 For HTI^L browsers that do not support the use of Java applets or JavaScript, an HTML- 
based method Is suggested. The method, which does not allow the clock difference 
measurement, is as follows: 

1. The client makes a GET request to the server, indicating that it is ready to cooperate 
In estimating the latency (for example, 

30 http://www.mvserver.CQm/servlet/CalculateLatencv) . 

2. The server, at leisure, returns an HTML document that immediately loads another 
HTML document from the same server. For example: 

<html><meta http-equiv="refresh" content="0; 
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URL=http://www.myseiverxom/seivlet/CalculateLate^ 
html> 

3- The server, again at leisure, can then estimate the latency of the connection based 
upon the time between sending the first document and receiving the request for the 
5 second. 

(S3) HTML- and JavaSciript-based method 

For HTML browsers that support JavaScript but not Java, an HTML- and JavaScript-based 
method can be used instead. This allows the approximate difference between the client's 
1 0 and server's clocl< to be calculated, thereby enabling the latency estimate to be updated 
on each subsequent request to the server. The method is as follows: 

1. The client makes a GET request to the server, indicating that it is ready to cooperate 
in estimating the latency. 

2. The server, at leisure, returns an HTI^L document containing JavaScript that 

1 5 immediately loads another HTML document from the same server. For example: 

<html><script language- V ^ ^auTipt"> ' . 

<!— 

self.locatlon = 

"http://www.myserver.com/servlet/CalculateLatency?stime=10027184728008LCtime=" 
20 + (new Date()).getTime(); 

//"> 

</script></html> 

3. The server, again at leisure, can then estimate the latency of the connection based 
upon the time between sending the first document and receiving the request for the 

25 second. 

4. The server can also estimate the difference between the client's and server's clocks 
using the latency (all times are by the server's clock unless otherwise stated): 

o Ti is the time at which the server sends the first document; 
o T2 is the time at which the client receives the response and begins loading the 
30 second document; 

o T3 is the time at which the server receives the request for the second document; 
o T2' is the time by the client's clock at the same instant that the server's clock 
reads T2. 
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T2' + adjustment^ Ji 
adjustment- Ti + latency- T2 
= Ti + y2(T3-T0-T2' 
= y2CT3+T0-T2' 

5 and T3, Ti and T2' are known by the server. Knowledge of the clocks' difference 

means that when the client makes future requests to the server, it can time-stamp 
them T + adjustment, which will approximate to T, the server's time when the 
client's clock reads T, thus enabling the latency to be recalculated. This will provide 
an effective re-esBmation under the condition that the latency may have changed, 
1 0 but is always the same in both directions on the channel. 

(sn) Java-based utnielthodl 

This exploits Java's greater control over POST requests to the server, opening what is in 
essence the second connection before it is actually needed. This also allows the 
1 5 difference between the two clocks to be estimated, and the process is: 

1 ; " ^'}^ makes a POST request to the server, but does not send any of the POST 

.:^n::t;rvauon yet, 

2. The client makes a GET request to the server, indicating that it Is ready to cooperate 
in estimating the latency. 
20 3. The server, at leisure, returns a text document containing its current time. 

4. This is immediately parsed by the client, which then straight away completes its 
previously-opened POST by sending the client's current time and the time received 
from the server. 

5. The server, again at leisure, can then estimate the latency of the connection based 
25 upon the time between sending the first document and receiving the request for the 

second. 

6. The server also estimates the difference between the client's and server's clocks 
using the latency as explained above. 

30 While all of these techniques to focus on the latency of the channel for the visual 

content, similar techniques can be used with Voice XML for speech content. Where Voice 
XML Is not used other techniques can be adopted to obtain the relevant information. 
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Where two channels are known to have similar characteristics (at least in so far as 
latency Is concerned), the latency calculated for one of the channels may be used to 
delay content on the other channel- Similar channels might be associated with two modes 
on the same multi-modal sessions, or even two modes in different sessions. This 
5 extension proves useful when clients are known to share similar channel characteristics, 
but one client is not capable of co-operating in the latency estimation procedure: in this 
case, the latency calculated through the more capable browser (or whatever) can be 
used to determine the treatment (e.g. delay or not relative to another channel) 
appropriate for the channel with the less capable browser (or whatever). 



The bandwidth of each network is calculated by the server, which records the total time 
taken to send a file to the client, then uses that and the size of the file to estimate the 
1 5 average bandwidth. Since multiple downloads can occur simultaneously, the server must 
be aware of downloads occurring at the same time as the one being measured. • ' 
downloads must be through the server for an accurate estimation of the bandwidth. 
Since the server is aware of what files it Is uploading to what client, and when each 
upload starts and stops, the effective upload time can be calculated. Take the following 
20 example of four files being uploaded to the same client: 
a b c d e f g 



bed 
-t>r<3 — X — x-wo- 



1 



FIEB 



-cy» 



r 



t>r<J- 



0 



PILE 4 



The horizontal axis represents time, and each arrow indicates the time period in which 
that file downloads. Taking HLE 1 as an example, the total upload time Isa + b + c + d 
+ e + f • However, at various times, more than one file is being uploaded to the client at 
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once- Making the assumption that all uploads have the same priority and therefore the 
same approximate proportion of total available bandwidth, the effective upload time for 
RLE 1 is a + b/2 + c/3 + d/4 + e/3 + f/2. A similar approach Is used for all uploads, 
yielding an approximation of the bandwidth. 
5 Implementation of Network Bandwidth Estimation 

A Java-based system has been developed to estimate the bandwidth of the network. All 
requests to the server are performed via servlets: simple requests are wrapped in a 
small-footprint servlet; servlet requests have additional logic. The system works by 
creating a single instance of a class that maintains information on current and historical 
1 0 downloads. Every request to the server causes - by virtue of the aforementioned 

wrapper or additional logic - a method call on this object that logs the download. When 
the download completes, a similar call is made to cause the bandwidth to be 
recalculated. 

In order to follow any changes in bandwidth, a limited-size history is maintained so that 
1 5 only the last N downloads' bandwidth calculations are included in the overall bandwidth 
estimation, which is essentially a running average. 

Before any requests are accepted, various data have to bi; prefi^^: including: 
o A lookup table to associate execution thread IDs with download start times, 
o A variable-size array to store information about each download start/finish event. 
20 When a call is made to indicate that a download is starting, the sequence of events is: 

1. Get the current time in milliseconds. 

2. Append an entry to the array to store the time and the [increased] number of 
downloads. 

3. Add the execution thread's ID to the lookup table so its start time can be determined 
25 when its download has finished. 

When a call is made to indicate that a download is finishing, the sequence of events is: 

1. Get the current time in milliseconds. 

2. Append an entry to the array to store the time and the [decreased] number of 
downloads. 

30 3. Get the start time from the lookup table, based upon the thread's ID. 

4. Calculate the bandwidth based upon this single download (see below for details). 

5. Update the running average of the bandwidth 
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It Is perhaps easiest to explain how the bandwidth is calculated for a single download by 
means of an example. The following table represents the array containing the download 

start/finish events: 







Number of 






Time 


Concurrent 
Downloads 


Comments: 


a 


1002718472800 


1 


Download 1 started 


b 


1002718473000 


2 


Download 2 started 


c 


1002718473900 


1 


Download 2 finished 


d 


1002718474000 


2 


Download 3 started 


e 


1002718475000 


3 


Download 4 started 


f 


1002718475600 


2 


Download 3 finished 


g 


1002718475700 


3 


Download 5 started 


h 


1002718475850 


2 


Download 1 finished 


i 


1002718478000 


3 


Download 6 started 



The lone-download time (i.e., It would have taken to download if there 

5 were no concurrent downloads) fo; li j vvnload 1 is given by the sum of the times between 
successive entries divided by the number of downloads in progress at that time. In other 

words, this is: 

(1002718473000 - 1002718472800) + 1 + 

(1002718473900 - 1002718473000) + 2 + 
1 0 (1002718474000 - 1002718473900) + 1 + 

(1002718475000 - 1002718474000) -i- 2 + 

(1002718475600 - 1002718475000) + 3 + 

(1002718475700 - 1002718475600) + 2 + 

(1002718475850 - 1002718475700) + 3 
15 = 200 + 450 + 100 + 500 + 200 + 50 + 50 

= 1550ms 

The size of the document being downloaded from the server to the client can either be 
retrieved from the server (by, for example, using the getContentLength() method of 
Java's URLConnection class) or, for dynamic documents, can be calculated by storing the 
20 document being generated and writing it out once its length is Icnown. 
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Thus, the effective bandwidtii for the duration of this document's download can be 
calculated by dividing the size by the tone-download time. 

Toftal Oocymeinifc Size 
Each document being uploaded from the server to the client is parsed to determine which 
5 other documents (Images, grammars, or frames, for example) are also automatically 
uploaded at the same time. Knowledge of the client's caching policy is required (including 
what inline content is automatically downloaded and which not), as is the initial state of 
its cache (most probably empty). As the server finds these additional documents, it 
maintains a total of the amount of data the must be sent to the client, based upon its 

1 0 knowledge of the client's cache. For example, a file that is not in the cache or which has 
expired will have its size added to the total; but one that is present in the cache but has 
not expired will not have its size added. As necessary, these other documents are also 
parsed recursively. An example of this might be when a frameset is being downloaded: 
each enclosed frame may also need to be downloaded, along with its images and sound 

15 files, etc.. 

Implementation of Total Document Size 
Once Uii oximate bandwidth between server and client is known, the time it will 
take to download the document and its sub-documents (including images, grammar files, 
non-streaming audio or video files, and child frames and their sub-documents) needs to 

20 be calculated. As already explained, it Is necessary to be able to guarantee one of two 
things to yield a successful estimate of total download time: (a) that the initial document 
has no sub-documents; or (b) that the caching policy of the client browser is known, as 
well as the initial state of the cache. In the first case, the total document size is the same 
as the initial document's size; the calculation is therefore trivial. In the second case, the 

25 server is able to determine which of the sub-documents will need to be downloaded by 
the client and which will not; calculation is essentially quite straightforward, as the 
following steps that the server will need to take demonstrate: 

1. Parse the initial document and determine what sub-documents it contains. Depending 
upon the complexity of the document, this task could range from very easy (as with a 
30 VoiceXML document containing grammars that are always - i.e., not dynamically - 
loaded) to very difficult (as with an applet that downloads various images, sound 
files, and Java classes; or as with an HTML document containing JavaScript that 
dynamically writes some of the HTML). 
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2. For each of the sub-documents, determine from its type whether It is inherently 
stand-alone (such as an Image) or whether it, like the initial document, contains other 
sub-documents to download (such as a frame with the initial document's frameset), 
then parse that according to step 1 as necessary. 
5 3. Repeat steps 1 and 2 until a list of all documents has been constructed. 

4, Based upon prior knowledge of the client's cache and caching policy, remove from 
the list all documents that the client will not need to download (because they are 
cached and are not expired). 

5. Calculate the total download size by summing the individual size of each document 
10 stiIMn the list. 

For the purposes of this Implementation, only relatively simple documents will be parsed 
In step 1. Dynamic documents are not covered by this Implementation; however, It Is 
clear that a full browser would be necessary to parse some of the more complex 
documents. A possible Implementation would be a proxy client, local to the server, that 

1 5 sits between the server and the client This would mirror the actual client, downloading 
pages from the server and passing them as a proxy to the remote client. The proxy client 
would have an Identical caching policy to the actual client (which would need Its cache 
aligned with the proxy's, most likely by clearing it) and would be in direct link with the 
server. In this way, the server does not need to calculate the amount of data that will be 

20 downloaded to the client, instead delivering it rapidly to the proxy client and summing 
the amount of data it delivers. 

The overall implementation described in this document also covers the trivial, no-sub- 
document case mentioned above. 

25 Once these three factors (network latency, network bandwidth and total document size) 
are known (along with implementation-specific factors as mentioned above), an estimate 
of the total time to deliver the content can be calculated for each client based upon its 
own network characteristics. The difference between the longest of these download 
times and each of the others can then be used as a delay. For example, if the longest of 

30 the clients' download times is 10 seconds, that client's content will be delivered as quickly 
as possible (i.e., with no delay). If another client's download time is 6 seconds, that 
client's content can be delayed (by the server) by 4 seconds to ensure that it finishes 
downloading at the same time as the first client. Of course, in some situations it may be 
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known that network latency, for example, is the dominant factor (e.g. where network 
bandwidth is high and total document size is not significant). In such a situation 
network latency may be the only fector which needs to be taken into account when 
estimating the total delivery time. 
5 Usually the estimation of download times and any addition of delay will be performed 
automatically under the control of the multi-modal internee process. 

ADteimaftlives to Delavliraa Contemitt Peiiiveiny 
Instead of simply delaying content, another, more user-friendly approach is to deliver a 
pre-content '"page" to all but the longest-download client, saying roughly how long the 
10 full content will take to download or specifying the time, T, at which the content should 
be delivered (using the timing estimates from above). 

Another approach is to only delay content when It absolutely has to be delivered at the 
same time. An example might be when an audio client says ''please speak one of the 
options on your screen"; it must not say this before the visual client has finished loading. 

1 5 A further approach, only possible in some systems (such as described GB 0108044.9 
Agenfs Ref. A26127), is to use an event mechanism whereb a ::t cilent sends a 
message to the server/ then waits for a response telling It to ' disp;;^^ t^n whatever way) 
the content. The server waits for all clients (or an appropriate, minimal, or predetermined 
selection of clients) to Indicate that they have finished loading before infomnlng the 

20 clients that they can commence ''display". (In an HTML browser, this could be achieved 
by using the documenfs/framesefs onload event, then loading a specific URL Into a 
JavaScript Image object and using that objects onload event to activate the page. The 
server would not reply to the 'image' URL request until the right selection of clients were 
ready.) 

25 The foregoing description has primarily focussed on systems in which audible content is 
delayed so that it does not arrive before the related visual content. The delay may be 
chosen so that the audible content is delivered at the same moment as the visual 
information or, as is more usual, the delay may simply be such that the visual information 
has been displayed and is visible to the user before (often just before) the audible 

30 content is delivered. Of course this latter may require that it is the visual content which 
is delayed in order to be presented to the user just before or simultaneously with the 
audible content, where the visual content would otherwise arrive too soon before the 
audible content. The application developer may delay other content in the same way, in 
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particular, visual content from two different sources or systems may need to be 
synchronised so that the correction of timing to achieve synchronisation may be of one 
quantum of visual content with respect to another quantum of visual content (with or 
without an related audible content). Another example where synchronisation could be of 
5 value , and hence where the invention could be applied Is in synchronising WML and 
HTIviL ( for example in using a WAP phone to control an HTML browser in a shop 
window, so the HTML browser is effectively Improving the graphical capabilities of the 
WAP phone). Another use case Is synchronising two voice browsers, each in a different 
language, so that two people of different nationalities could work together to complete a 
1 0 form. A further example is the synchronisation of a voice Interface (e.g. a voice browser) 
with a tactile (or haptic) interface such as a Braille terminal, so that a blind person can 
benefit from multi-modality, much as a sighted person does when using visual and 
audible interfaces." 

15 

The application developer may also ^ f- ^-2 degree of synchronisation by Indicating 
the maximum allowable delay betvva^^ lh^^ arrival of different content for it to be 
considered simultaneous. The described process can be applied to any combination of 
any number of modes, and it is the application developer's decision which of these are 
20 delayed to arrive simultaneously or synchronously. 

The invention has been described in the context of content synchronisation in multi- 
modal interfaces. The principles behind the invention extend beyond multi-modal 
interfaces and may, for example, be used to good effect for the synchronisation of clients 
25 for more than one person, such as two (or more) people in separate locations viewing 
the same web page together, when the synchronisation would be of the web browsers of 
the two (or more) users. 



